Business Insight from Collection of Unstructured Formatted Documents with IBM Content Harvester

نویسندگان

  • Biplav Srivastava
  • Yuan-Chi Chang
چکیده

In this paper, we report the development and experiments of IBM Content Harvester (CH), a tool to analyze and recover templates and content from word processor created text documents. CH is part of a bigger effort to collect and reuse material generated in business service engagements. Specifically, it works on unstructured formatted documents and works by extracting content, cleansing off sensitive information, tagging it based on user-defined or domain-defined labels, and making it available for publishing in any open format and flexible querying. As a result, one can search for specific information based on tags, aggregate information regardless of document source or formatting peculiarities and publish the content in any format or template. CH has been applied to a broad variety of document collections containing hundreds of documents, including live engagements, to promising effect.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards a Process-Oriented Approach to Assessing, Classifying and Visualizing Enterprise Content with Document Maps

Nowadays, documents can be scattered across a company in different versions, formats, and languages, and even on different systems. Not only is the resulting content chaos inefficient, it brings with it a number of risks. However, information that is contained in unstructured documents is increasingly becoming a key business resource. Enterprise content management (ECM) is used to manage unstru...

متن کامل

Measuring Similarity between XML Documents

With the advance of World Wide Web standards, XML documents become popular in e-business applications for information exchange. Electronic catalogs and transaction records are now formatted in XML. XML documents are semi-structured documents with XML schemas marking up the semantics. XML separates presentation from semantics so that presentation of information on different devices can be proces...

متن کامل

CLUSTERING HYPERTEXT WITH APPLICATIONS TO WEBSEARCHINGDharmendra

This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and speciic requests. After outside...

متن کامل

Context-free Grammar Learning from Text Document using Sequential Pattern

The World-Wide-Web and information system has gained significant achievements over the last two decades as expressed their dominance in various business and scientific applications. As estimated by Blumberg and Atre more than 85% of all business information exists in the form of unstructured and semi-structured document, typically formatted for human viewing, not for system processing. Extracti...

متن کامل

Analytical Comparison of Business Network and Business Ecosystem

Emergence of the new concepts in business field of study beside of increasing environmental changes has made it more complex. The continuous changes, shows the importance of theoretical and practical readiness in relation to environmental issues. Two topics that have got attention of both academics and practitioners are “Business Network” and “Business Ecosystem”. This paper tries to scrutinize...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009